﻿ Metadata of the chapter that will be visualized in SpringerLink Book Title Language Production, Cognition, and the Lexicon Series Title6636 Chapter TitleQuo Vadis: A Corpus of Entities and Relations Copyright Year2014 Copyright HolderNameSpringer International Publishing Switzerland Corresponding AuthorFamily NameCristea Particle Given NameDan Prefix Suffix DivisionFaculty of Computer Science Organization“Alexandru Ioan Cuza” University of Iaşi AddressIasi, Romania DivisionInstitute for Computer Science OrganizationRomanian Academy - The Iaşi Branch AddressIaşi, Romania Emaildcristea@info uaic ro AuthorFamily NameGîfu Particle Given NameDaniela Prefix Suffix DivisionFaculty of Computer Science Organization“Alexandru Ioan Cuza” University of Iaşi AddressIasi, Romania Emaildaniela gifu@info uaic ro AuthorFamily NameColhon Particle Given NameMihaela Prefix Suffix DivisionDepartment of Computer Science OrganizationUniversity of Craiova AddressCraiova, Romania Emailmcolhon@inf ucv ro AuthorFamily NameDiac Particle Given NamePaul Prefix Suffix DivisionFaculty of Computer Science Organization“Alexandru Ioan Cuza” University of Iaşi AddressIasi, Romania Emailpaul diac@info uaic ro AuthorFamily NameBibiri Particle Given NameAnca-Diana Prefix Suffix DivisionDepartment of Interdisciplinary Research in Social-Human Sciences Organization“Alexandru Ioan Cuza” University of Iaşi AddressIaşi, Romania Emailanca bibiri@info uaic ro AuthorFamily NameMărănduc Particle Given NameCătălina Prefix Suffix Division Organization“Iorgu Iordan-Al Rosetti” Institute of Linguistics of the Romanian Academy AddressBucharest, Romania Emailcatalina maranduc@yahoo com AuthorFamily NameScutelnicu Particle Given NameLiviu-Andrei Prefix Suffix DivisionFaculty of Computer Science Organization“Alexandru Ioan Cuza” University of Iaşi AddressIasi, Romania DivisionInstitute for Computer Science OrganizationRomanian Academy - The Iaşi Branch AddressIaşi, Romania Emailliviu scutelnicu@info uaic ro Abstract This chapter describes a collective work aimed to build a corpus including annotations of semantic relations on a text belonging to the belletristic genre The paper presents conventions of annotations for four categories of semantic relations and the process of building the corpus as a collaborative work Part of the annotation is done automatically, such as the token/part of speech/lemma layer, and is performed during a preprocessing phase Then, an entity layer (where entities of type person are marked) and a relation layer (evidencing binary relations between entities) are added manually by a team of trained annotators, the result being a heavily annotated file A number of methods to obtain accuracy are detailed Finally, some statistics over the corpus are drawn The language under investigation is Romanian, but the proposed annotation conventions and methodological hints are applicable to any language and text genre Keywords (separated by '-')Semantic relations - Annotated corpus - Anaphora - XML - Annotation conventions Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:1/39 12Quo Vadis: A Corpus of Entities 3and Relations ProofDan Cristea, Daniela Gîfu, Mihaela Colhon, Paul Diac,Author F45Anca-Diana Bibiri, Ca˘ta˘lina Ma˘ra˘nduc 6and Liviu-Andrei Scutelnicu 7AbstractThis chapter describes a collective work aimed to build a corpus including annotations of semantic relations on a text belonging to the belletristic 9genre The paper presents conventions of annotations for four categories of 10semantic relations and the process of building the corpus as a collaborative work 11Part of the annotation is done automatically, such as the token/part of speech/ PROO8 12lemma layer, and is performed during a preprocessing phase Then, an entity layer 13(where entities of type person are marked) and a relation layer (evidencing binary 14relations between entities) are added manually by a team of trained annotators, the 15result being a heavily annotated ﬁle A number of methods to obtain accuracy are D Cristea (&)D GîfuP DiacL -A Scutelnicu Faculty of Computer Science, ‘‘Alexandru Ioan Cuza’’ University of Iasi, Iasi, Romania e-mail: dcristea@info uaic ro D Gîfu e-mail: daniela gifu@info uaic ro P Diac e-mail: paul diac@info uaic ro L -A Scutelnicu e-mail: liviu scutelnicu@info uaic ro D CristeaL -A Scutelnicu Institute for Computer Science, Romanian Academy - The Iasi Branch, Iasi, Romania M Colhon Department of Computer Science, University of Craiova, Craiova, Romania e-mail: mcolhon@inf ucv ro A -D Bibiri Department of Interdisciplinary Research in Social-Human Sciences, ‘‘Alexandru Ioan Cuza’’ University of Iasi, Iasi, Romania e-mail: anca bibiri@info uaic ro Ma˘ra˘nduc ‘‘Iorgu Iordan-Al Rosetti’’ Institute of Linguistics of the Romanian Academy, Bucharest, Romania UNCORRECTEDC e-mail: catalina maranduc@yahoo com Springer International Publishing Switzerland 20141 N Gala et al (eds ),Language Production, Cognition, and the Lexicon, Text, Speech and Language Technology 48, DOI 10 1007/978-3-319-08043-7 28 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:2/39 2D Cristea et al 16detailed Finally, some statistics over the corpus are drawn The language under 17investigation is Romanian, but the proposed annotation conventions and meth- 18odological hints are applicable to any language and text genre ProofKeywordsSemantic relationsAnnotated corpusAnaphoraXMLAnnotationAuthor 20conventions F1921 221 Introduction When we read books we are able to discover in the sequence of words, with 24apparently zero effort, the mentions of entities, the relationships that connect 25entities, as well as the events and situations the characters are involved in Entities, 26relations, events and situations can be of many types For instance, entities could PROO23 27be: persons, animals, places, organisations, crafts, objects, ideas, etc as well as any 28grouping of the above Relations linking characters could be anaphoric (when the 29interpretation of one mention is dependent on the interpretation of a previous one), 30affectional (when a certain feeling or emotion, is expressed in the interaction of 31characters), kinship (when family relationships are mentioned, sometimes com- 32posing very complex genealogical trees), social (when job hierarchies or social 33mutual ranks are explicitly remarked), etc Moreover, the text could include 34mentions about relations holding between characters and other types of entities: 35persons are in places, persons belong to organisations, mutual positioning of 36locations in space, etc Deciphering different types of links is a major step in 37understanding a book content AQ1 38We address in this paper the issue of building a corpus that makes explicit a 39strictly delimited set of binary relations between entities that belong to the fol- 40lowing types: persons, gods, groups of persons and gods, parts of bodies of persons 41and gods The relations marked in the corpus belong to four binary types: ana- 42phoric, affectional, kinship and social 43Very often the interpretation of semantic relations is subjective being the result 44of a personal interpretation of the text, more precisely, of inferences developed by 45the reader or obtained by putting on stage some extra textual general knowledge 46In building this corpus we avoided anything that is not explicitly uttered in the 47text, trying thus to keep the subjective interpretation to a minimum Also, we were 48not interested to decode time moments, nor the different ways in which time could 49be connected to entities or to the events they participate in The motivation for this endeavour is to base on this ‘‘gold’’ corpus the con- 51struction of a technology able to recognise these types of entities and these types of 52semantic relations in free texts The experience acquired while building such a 53technology could then be applied in extrapolating it to other types of entities and UNCORRECTED50 54semantic relations, ﬁnally arriving to a technology able to decipher the semantic 55content of texts When a human reads a text, she/he ﬁrst deciphers the semantic Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:3/39 Quo Vadis…3 56links as they are mentioned in the text, and only then these isolated pieces of 57knowledge are connected by inferences, that often engage particular knowledge of 58the reader, thus obtaining a coherent, albeit personal, interpretation of the text, as a whole A novel could be the source of as many interpretations as readers it has Proof60The freedom to build a proper interpretation makes the relish of most readers TheAuthor 61issue of building a machine interpretation of a novel should therefore be consid- F5962ered also from this perspective (which could be a rich source of philosophical 63questions, too):how much do we want our technology be able to add above the 64strict level of information communicated by the text?We believe that the answer to 65this question shall be rooted in the actual type of application that is desired, but the 66primary challenge is this one:are we capable to interpreted basic assertions 67expressed in a book? We are tempted to conclude this introduction by making a parallel with the way 69human beings do grow up, the accumulation of their memories and readings 70assembling in time their personalities, characters, predisposition for certain types 71of behaviour Similarly, one may imagine the formation of intelligent behaviours PROO68 72in agents, which could be rooted in the artiﬁcial interpretation of books, this way 73short-circuiting a real life-time formation Fictional worlds, if proper selected and 74fed into agents, are extremely rich in life examples, and a plurality of semantic 75aspects belonging to real or ﬁctional lives could be studied if recognized in other 76contexts 77The paper is organised as follows In the following section we give reasons for 78selecting the text of a novel as the basis for this corpus Then we make a brief tour 79in the literature to conﬁgure the state-of-the-art in work related to semantic rela- 80tions Section4presents the annotations conventions for entities and the four 81categories of relations annotated in the corpus Then Sect 5details the activities 82for the creation of the corpus The notations in the corpus allowed to make dif- 83ferent counts and comparisons They are presented in Sect 6 Finally, Sect 7 84makes concluding remarks and presents ways of exploitations of the corpus 852 Why a Corpus Displaying Semantic Relations 86in Free Texts? 87To understand a language one needs not only means of expression, but also 88vehicles of thought, necessary to discover new ideas or clarify existing ones by 89reﬁning, expanding, illustrating more or less well speciﬁed thoughts (Zock2010) 90Semantic relations describe interactions, connections Connections are indispens- able for the interpretation of texts Without them, we would not be able to express 92any continuous thought, and we could only list a succession of images and ideas 93isolated from each other Every non-trivial text describes a group of entities and UNCORRECTED9194the ways in which they interact or interrelate Identifying these entities and the 95relations between them is a fundamental step in text understanding (Na˘stase et al Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:4/39 4D Cristea et al 962013) Moreover, relations form the basis for lexical organization In order to 97create lexical resources, NLP researchers proposed a variety of methods, including 98lexico-syntactic patterns and knowledge-based methods exploiting complex lexi- cal resources Somehow similar situations occur when we reverse the task and go Proof100from text interpretation to text production As Michael Zock and collaboratorsAuthor 101have shown (Zock et al 2010; Zock and Schwab2013) lexical resources are only F99102truly useful if the words they contain are easily accessible To allow for this, 103association-based indexes can be used to support interactive lexical access for 104language producers As such, co-occurrences can encode word associations and, 105by this, links holding between them 106Developing a lexical-semantic knowledge base to be used in Natural Language 107Processing (NLP) applications is the main goal of the research described in this chapter Such resources are not available for many languages, mainly because of 109the high cost of their construction The knowledge base is built in such a way as to 110facilitate the training of programs aiming to automatically recognise in text entities 111and semantic relations As we will see in the next sections, it includes annotations PROO108 112for the spans of text that display mentions of entities and relations, for the argu- 113ments (poles) of the relations, as well as for the relevant words or expressions that 114signal relations 115We are not concerned in the research described in this paper neither with the 116automatic recognition of entities, nor with the recognition of the relationships 117between entities Instead we present how a signiﬁcant corpus marking entities and 118relations has been built The recognition problem will be our following objective 119Na˘stase et al (2013) show many examples of NLP tasks that are based on the 120ability to identify semantic relations in text, such as: information extraction, 121information retrieval, text summarization, machine translation, question answer- 122ing, paraphrasing, recognition of textual entailment, construction of thesauri and of 123semantic networks, word-sense disambiguation, language modelling Although the 124issue of automatic semantic relations recognition has received good attention in 125the literature, we believe that the problem is far from being exhausted Moreover, 126most of the research in this area is focused on corpora composed of press articles 127or on Wikipedia texts, while the corpus we describe is built on the skeleton of a 128ﬁctional text 129As a support for our annotations we have chosen a novel, i e a species of the 130epic genre, which is particularly rich in semantic relations The text used was the 131Romanian version of the novel ‘‘Quo Vadis’’, authored by the Nobel laureate 132Henryk Sienkiewicz 1In this masterpiece the author narrates an extremely com- 133plex society, the pre-Christian Rome from the time of the emperor Nero The text 134displays inter-human relations of a considerable extent: love and hate, friendship and enemy, socio-hierarchical relationships involving slaves and their masters, or 136curtains and the emperor, etc The story is dynamic, it involves many characters UNCORRECTED135 1 The version is the one translated by Remus Luca and Elena Linta˘and published at Tenzi Publishing House in 1991 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:5/39 Quo Vadis…5 137and the relations are either stable (for instance, those describing family links) or 138change in time (sexual interest, in one case, and disgust, in another, both evolve 139into love, friendship depreciates in hate, lack of understanding develops into worship) Appreciative affective relations may differ depending on whether they Proof141hold between human characters, when it takes the form of love, or betweenAuthor 142humans and gods, when it shapes as worship Another aspect of interpretation F140143relates the contrast between the social and affective relationships, dictatorial, based 144on fear and abeyance, at the court of Nero, and those found in the evolving but 145poor Christian society, based on solidarity and forgiveness 146Annotating a novel to entities and semantic relations between them represents a 147signiﬁcant challenge, given the differences between journalistic and literary style, 148the ﬁrst being devoid of ﬁgures of speech that create suggestive images and emotional effects into the mind of the reader 150Not the least, the novel ‘‘Quo Vadis’’ is translated in extremely many lan- 151guages We have thought at the possibility to exploit the semantic annotations of 152the Romanian version for other languages, by applying exporting techniques PROO149 153Rather many results in the last years have shown that, in certain conditions, 154annotations can be exported on parallel, word-aligned, corpora, and this usually 155shortens the annotation time and reduces the costs (Postolache et al 2006; Drabek 156and Yarowsky2005) 157Of course, the annotation conventions are expressed in English to facilitate their 158use for similar tasks in other languages As is usual the case with semantic labels, 159they are expected to be applicable without adaptation effort to any language Most 160of the examples are extracted from the Romanian version of the book and then 161aligned passages are searched in an English version 2The examples are meant to 162show also the speciﬁcities of different language versions: syntactic features, but 163mainly idiosyncrasies of translation make that in certain cases the notations per- 164tinent to the two versions differ We make special notes when these cases occur 1653 Similar Work 166Murphy (2003) reviews several properties of semantic relations, among them 167unaccountability, which basically marks relations as an open class If considered 168between words and not entities, then relations complement syntactic theories, like 169functional dependency grammars (Tesnière1959) or HPSG (Pollard and Sag 1701994) Taken to its extreme, we might say that the very meanings of words in 171contexts is constituted by the contextual relations they are in (Cruse1986) Lyons (1977), another important schooler of the British structural semantic tradition, 173considers that (p 443) ‘‘as far as the empirical investigation of the structure of 174language is concerned, the sense of a lexical item may be deﬁned to be, not only UNCORRECTED172 2 Translation in English by Jeremiah Curtin, published by Little Brown and Company in 1897 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:6/39 6D Cristea et al 175dependent upon, but identical with, the set of relations which hold between the 176item in question and the other items in the same lexical system’’ 177In NLP applications, as in this work, the use of semantic relations is, generally, understood in a more narrower sense We are concerned about relations between Proof179concepts or instances of concepts, i e conceptualisations, representations on aAuthor 180semantic layer of persons, things, ideas, which should not be confused with F178181relations between terms, i e words, expressions or signs, that are used to express 182these conceptualisations For instance, novel, as a concept, should be distinguished 183fromnovel, as an English word, orroman, its Romanian equivalent The concept 184novel may be expressed by the terms or expressionsnovel,ﬁctionorfantasy 185writing The relation betweennovelandﬁctionis a synonymy relation between 186two words These words are synonyms because there exist contexts of use where they mean the same thing Examples of lexical databases are WordNet (Miller 188et al 1990), where lexical items are organized on synonymy sets (synsets), 189themselves representing concepts placed in a hierarchy, or Polymots (Gala et al 1902010), that reveals and capitalizes on the bidirectional links between the semantic PROO187 191characterization of lexical items and morphological analogies 192We place our work in the semantic register, there where unambiguous meanings 193about words or expressions in contexts have been formed and what is looked for 194are the relations that the text expresses between these conceptualisations, or 195entities The domain is known asentity linking 196One kind of semantic relation is the hyperonym relation, also calledis a(and 197usually noted ISA), linking a hyponym to a hyperonym, or an instance to a class, in 198a hierarchical representation, for instance in a taxonomy or an ontology (Ais a 199kind ofB,Ais subordinate toB,Ais narrower thanB,Bis broader thanA) 200Dictionaries usually use the Aristotelian pattern to deﬁne adeﬁniendumby iden- 201tifying agenus proximusand showing thediferentiae speciﬁcaewith respect to 202other instances of the same class (Del Gaudio2014) Long time ago, Quillian 203(1962) imagined a model of human memory, thus introducing the concept of 204semantic network, a knowledge graph representation in which meaning is deﬁned 205by labelled relations (connections) between concepts and their instances The most 206common relation in these representations are the ISA relations (for example, 207‘‘Petroniusis aproconsul in Bithynia’’), but other types includepart of,has as 208part, etc 209Most of the work in semantic relations and entity linking addresses the rec- 210ognition problem and, on a secondary scale, the issue of building signiﬁcant 211corpora to support this activity If NLP systems are to reach the goal of producing 212meaningful representations of text, they must attain the ability to detect entities 213and extract the relations which hold between them The termentity linkingusually expresses the task of linking mentions of entities 215that occur in an unstructured text with records of a knowledge base (KB) As such, 216the most important challenges in entity linking address: name variations (different 217text strings in the source text refer the same KB entity), ambiguities (there are UNCORRECTED214 218more than one entity in the KB a string can refer to) and absence where there is no 219entity description in the KB to which a string in the source text representing an Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:7/39 Quo Vadis…7 220entity could possibly match with (Bunescu and Pasca2006; Cucerzan2007) Let’s 221note also that in the interpretation of ﬁctional texts, each time a process starts, the 222KB is empty and it will be populated synchronously with the unfolding of the text and the encountering of ﬁrst mentions Proof224Another point of interest is the detection of relations holding between entitiesAuthor 225and events Mulkar-Mehta et al (2011) for instance, focused on recognising the F223226part-whole relations between entities and events as well as causal relations of 227coarse and ﬁne granularities Bejan and Harabagiu (2010) and Chen et al (2011) 228showed that coreferences between events can be detected using a combination of 229lexical, part-of-speech (POS), semantic and syntactic features Their corpus, the 230EventCorefBank, restricted by the Automatic Content Extraction exercise to the 231domains ofLife,Business,ConﬂictandJustice, contained articles on 43 different topics from the Google News archive Also, Cybulska and Vossen (2012) anno- 233tated theIntelligence Community Corpus, at coreference mentions between violent 234events as bombings, killings, wars, etc 235Corpora detailing semantic relations in ﬁctional texts are rare if not inexistent PROO232 236Usually supporting the activities of semantic analysis are texts belonging to the 237print press 238Some have been annotated with predicate-argument structures and anaphoric 239relations, as theKyoto University Text Corpus(Kawahara et al 2002) and the 240Naist Text Corpus(Iida et al 2007) The anaphoric relations are categorized into 241three types (Masatsugu et al 2012): coreference, annotated with the ‘‘=’’ tag, 242bridging reference, that can be expressed in the formBofA, annotated by ‘‘NO:A’’ 243toB, and non-coreference anaphoric relations, annotated with ‘‘^’’ TheBalanced 244Corpus of Contemporary Written Japanese(BCCWJ)3includes publications 245(books, magazines) and social media texts (blogs and forums) annotated with 246predicate-argument structures as deﬁned in FrameNet (Ohara2011) They do not 247annotate inter-sentence semantic relations Although the predicate-argument 248structures of FrameNet include the existence of zero pronoun, referents are not 249annotated if not existent in the same sentence Since anaphoric relations are not 250annotated, they do not annotate the inter-sentence semantic relations 251Another type of annotation regarding the semantic level, focusses speciﬁcally 252on anaphoric relations, including zero anaphora, as is theLive Memories Corpus 253(Rodríguez et al 2010), originating in the Italian Wikipedia and blogs Since in 254Italian pronoun-dropping only occurs in the subject position (same as in Roma- 255nian), they transfer to the corresponding predicates the role of anaphors To the 256same category belongs also theZ-corpus(Rello and Ilisei2009), incorporating 257Spanish law books, textbooks and encyclopedia articles, treating zero anaphora 258There too, pronoun-dropping is marked in the subject position In all cases, corpora annotated to semantic links are intended to be used to train 260recognition algorithms In principle, the annotation layers, the constraints used in 261the annotation, and the annotation conventions should be related to the set of UNCORRECTED259 3 http://www tokuteicorpus jp/ Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:8/39 8D Cristea et al 262features used by the recogniser and not to the learning methods used in training 263For that reason, in the following we will revise brieﬂy algorithms and sets of 264features used in the training of semantic links detection As regards the methods used in recognition, some approaches use supervised Proof266machine learning to match entity mentions onto their correspondent KB records Author 267Rao et al (2012) score entities contained in the KB for a possible match to the query F265268entity Many studies make use of syntactic features extracted by deep or shallow 269parsing, POS-tagging and named entity annotation (Pantel et al 2004) The 2012 270Text Analysis Conference launched theKnowledge Base Population Cold Starttask, 271requiring systems to take a set of documents and produce a comprehensive set of 272\Subject, Predicate, Object[triples that encode relationships involving named- 273entities Starting with (Hearst1992), many studies have used patterns incorporating information at the lexical and syntactic level for identiﬁcation of instances of 275semantic relationships (Banko et al 2007; Girju et al 2006; Snow et al 2006) The 276system KELVIN, for instance, integrates a pipeline of processing tools, among 277which a basic tool is the BBN’s SERIF (Statistical Entity and Relation Information PROO274 278Finding) SERIF does named-entities identiﬁcation and classiﬁcation by type and 279subtype, intra-document co-reference analysis, including named, nominal and 280pronominal mentions, sentence parsing, in order to extract intra-sentential relations 281between entities, and detection of certain types of events (Boschee et al 2005) 282An accurate entity linking technique is dependent on a diversity of NLP 283mechanisms, which should work well in workﬂows Ad hoc linking techniques are 284on the class of one-document and cross-document anaphora resolution (Bagga and 285Balwdin1998; Saggion2007; Singh et al 2011) RARE is a system of anaphora 286resolution relying on a mixed approach that combines symbolic rules with learning 287techniques (Cristea and Dima2001; Postolache et al 2006) A recently improved 288version of it, developed for the ATLAS project4has given good results for a 289number of European languages (Anechitei et al 2013) 2904 Annotation Conventions 291Vivi Na˘stase, citing Levi (1978) and Séaghdha and Copestake (2008) in her book 292(Na˘stase et al 2013) enumerates a set of principles for relation inventories: 293•the inventory of relations should have good coverage; 294•relations should be disjunct, and should describe a coherent concept; 295•the class distribution should not be overly skewed or sparse; 296•the concepts underlying the relations should generalize to other linguistic phenomena; 298•the guidelines should make the annotation process as simple as possible; 299•the categories should provide useful semantic information UNCORRECTED297 4 http://www atlasproject eu/ Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:9/39 Quo Vadis…9 300In this section we present a set of annotation conventions that observe the above 301principles and were put at the bases of the ‘‘Quo Vadis’’ corpus Proof4 1 Layers of AnnotationAuthor F302 303The Romanian automatic pre-processing chain applied on the raw texts of the book 304consists of the following tasks, executed in sequence: 305•segmentation at sentence level (marks the sentence boundaries in the raw book 306text); 307•tokenization (demarcates words or word compounds, but also numbers, punctuation marks, abbreviations, etc ); 309•part-of-speech tagging (identiﬁes POS categories and morpho-syntactic infor- 310mation of tokens); PROO308311•lemmatization (determines lemmas of words); 312•noun phrase chunking (explores the previous generated data and adds infor- 313mation regarding noun phrase boundaries and their head words) (Simionescu 3142012) 315Let’s note that we have not ﬁnd a standard for annotating entities and relations 316Na˘stase et al (2013) says on this issue: ‘‘A review of the literature has shown that 317almost every new attempt to analyze relations between nominals leads to a new list 318of relations We observe that a necessary and sufﬁcient list of relations to describe 319the connections between nominals does not exist’’ As such, we went on with our 320own suggestions, knowing well that, at any moment in the future, if a need to adopt 321a standard will arise, an automatic conversion will be possible 3224 2 Annotating Entities 323Let’s note that our intention is to put in evidence entities such as are they men- 324tioned in a piece of literature These are characters or groups that play different 325roles in the development of the story A human reader usually builds a mental 326representation for each of them the very moment those characters (or groups) are 327mentioned ﬁrst, and these representations are recalled from memory any time they 328are evoked subsequently The mental representation associated with a character 329may change while the story unfolds, although a certain mapping remains constant It is just like we associate a box or a container with each character and afterwards 331we ﬁll it with details (name, sex, kinship connections, composition, beliefs, reli- 332gion, etc ) Some of these details may change as the story goes on, only the 333container remains the same Any mention of that character is a mapping from a UNCORRECTED330 334text expression to the corresponding container In text, we annotate mentions, not 335containers, but recreate them after processing the coreference mappings, as will Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:10/39 10 D Cristea et al 336become clear in the next sections So, what we call entities are these containers, or 337empty representation structures, as holders to associate text mentions on However, 338as will be shown later in Sect 4 3, the notation we use for entities’ mentions is an XML element also called ENTITY Proof340We concentrate only on entities of typePERSONandGOD, but group of personsAuthor 341are also annotated asPERSON-GROUP, occupations or typologies—coded as F339342PERSON-CLASS, human anatomical parts, codedPERSON-PARTand names of 343persons, codedPERSON-NAME Similarly, there will be:GOD-GROUP,GOD- 344CLASS,GOD-PARTandGOD-NAME, although very few of these last types, if any, 345really occurred It is well known that an isomorphism exists between the world of 346humans and that of gods In the Greek and then in the Roman antiquity, co-exist 347the same types of relations as those holding among humans The man Christ became a god in the Christian religion As such, to talk about men and women and 349to neglect gods was not an option, because would have created an artiﬁcial barrier 350Syntactically, the text realisation of entities are nominal phrases (NPs) Using a 351term common in works dedicated to anaphora, we will also call them referential PROO348 352expressions (REs), because the role of these expressions is to recall from memory 353(or refer to) the respective containers (where the features add on) We will use the 354term NP when we discuss syntactic properties of the expression and RE, when we 355discuss text coverage and semantic properties It will be normal, therefore, to say 356that a NP has a syntactic head and a RE mentions (or represents, or refers) an 357entity A noun phrase normally has a nominal or pronominal head and can include 358modiﬁers: adjectives, numerals, determiners, genitival particles, and even prepo- 359sitional phrases Some examples are5:[Ligia], [Marcus Vinicius], [împaratul] ([the 360emperor]), [al lui Petronius] ([of Petronius]), [el] ([he]), [imperiul Roman] ([the 361Roman empire]), [un grup mare de credinciosi] ([a big group of believers]) There 362is one exception to this rule: nominal phrases realised by pronominal adjectives, as 363[nostru](our) or [ale noastre](ours) We do not include relative clauses (relative 364pronouns preﬁxing a verb and, possibly, other syntactic constituents) in the 365notation of entities A relative pronoun is marked as an individual entity Example: 366[Petronius], [care]era…([Petronius], [who]was…) 367Not marked are also the reﬂexive pronouns in the reﬂexive forms of verbs, like 368in:eise spala˘(theyREFL-PRONwash); but other reﬂexive pronouns not 369appearing in a verbal compound are marked:siesi,sine(herself,himself), etc 370A NP may textually include another NP We will say they are ‘‘imbricated’’, 371and, by abuse of language, sometimes we will say the corresponding entities are 372also ‘‘imbricated’’ It should be noted that imbricated NPs have always separate 373heads and they represent always distinct entities NPs heads would be, therefore, 374sufﬁcient to represent entities Still, because we want our corpus to be useful inclusively for training NP chunkers, as REs we notate always the whole NP 5 UNCORRECTED375In all examples of this chapter we will notate occurrences of entities between square brackets, and we will preﬁx them with numbers to distinguish among them, there where their identities are important Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:11/39 Quo Vadis…11 376constructions, not only their heads When more imbricated NPs have the same 377head, only the longest is annotated as an entity Example: [alte femei din[soci- 378etatea înalta˘]] ([other women of[the high society]]), and not [alte[femei]din [societatea înalta˘]] or [alte[femei din[societatea înalta˘]], because [femei] as well Proof380as [femei din societatea înalta˘] and [alte femei din societatea înalta˘] all have theAuthor 381same head: ‘‘femei’’ Another syntactic constraint imposes that there are not NPs F379382that intersect and are non-imbricated 383We have instructed our annotators to try to distinguish identiﬁcation descrip- 384tions from characterisation descriptions For instance, inacest ba˘rbat, stricat 385pâna˘-n ma˘duva oaselor(this man, rotted to the core of his bones),stricat pâna˘-n 386maduva oaselor(rotted to the core of his bones) is a caracterisation description It 387does not help in the identiﬁcation of a certain men among many and should be neglected in the notation of a RE Alternatively, inconvoi de fecioare(a band of 389maidens)—the sequencede fecioare(of maidens) is an identiﬁcation description, 390because it uniquely identiﬁes ‘‘the band’’ among many others and it should be 391included in the RE aimed to refer that band as an entity group Only identiﬁcation PROO388 392descriptions will be marked as REs 3934 3 Annotating Relations 394One class of anaphoric relations and three classes of non-anaphoric relations are 395scrutinised, each with sub-types We present annotation conventions and meth- 396odological prerequisites based on which a corpus that puts in evidence characters 397and relations mentioned as holding between them has been manually built 398As will be seen in the following sub-sections, each relation holds between two 399arguments, that we will callpoles, and, with one exception, is signalled by a word 400or an expression, that we will calltrigger In general, when marking relations we 401want to evidence the minimal span of text in which a reader deciphers a relation 402Excepting for coreferential relations, in which poles can be sometimes quite dis- 403tant in text and there is nothing to be used as a trigger, usually relations are 404expressed locally in text, within a sentence, within a clause, or even within a noun 405phrase As such, excepting for coreferentiality, each relation span should cover the 406two poles and the trigger 407Our notations are expressed in XML Basic layers of annotation include: bor- 408ders of each sentence (marked as\S[\/S[elements, and identiﬁed by unique 409IDs) and words (marked as\W[\/W[and including unique IDs, lemma and 410morpho-syntactic information) Above these basic layers the annotators marked three types of XML elements: 412•ENTITY—delimiting REs, including the attributes:ID,TYPEand, optionally, 413HEAD; as will be explained below, for included subjects (pronoun-dropping) UNCORRECTED411 414the verb is annotated instead as anENTITY; Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:12/39 12 D Cristea et al 415•TRIGGER—marking relations’ triggers; it delimits a word (\W[) or an 416expression (sequence of\W[); 417•REFERENTIAL,AFFECT,KINSHIPandSOCIAL—mark relations With the exception of coreferential relations, these markings delimit the minimal spans Proof419of text that put in evidence these types of relations Their attributes are: aAuthor 420unique ID, the sub-type of the relation (listed below), the two poles and the F418421direction of the relation (the attributesFROMandTO), and the ID of a trigger 422(the attributeTRIGGER) 423The two poles of a relation could be intersectable or not It they are intersec- 424table, then they are necessarily nested and the direction convention is to consider 425theFROMentity the larger one and the TO entity the nested one 1:[celui de-al doilea\sot ] (1:[to the second husband2:[of 427Popeea]])¼) spouse-of (Bagga and Balwdin1998)6 428As already mentioned, the coreferential relation could never be expressed PROO426429between nested REs If the RE poles are not nested, we adopted a right-to-left 430direction in annotating coreferential relations This decision will be defended in 431the next sub-section For the rest of relations the convention is to consider the 432direction as indicated naturally by reading the trigger and its context For instance, 433in the text ‘‘X loves Y’’, the relation love, announced by the triggerloves, is 434naturally read as [X]love[Y], therefore withFROM=X andTO=Y, but in ‘‘X 435is loved by Y’’, the relation will be [X] loved-by [Y] 436It could happen that a pole of a relation is not explicitly mentioned This 437happens in cases of included subjects, when the subjects are expressed by null (or 438dropped) pronouns In Romanian, the morphological properties of the subject are 439included in the predicate, such that the missing pole will be identiﬁed with the 440predicate 441Example dar(1:[îl]si2:[\iubeau[,REALISATION=’’INCLUDED’’])din 442tot suﬂetul(2:[\loved[REALISATION=’’INCLUDED’’] 1:[him]with the whole 443soul))¼) loves 444It should be noted that a word could be simultaneously marked as a token 445(\W[), trigger (\TRIGGER[) and entity (\ENTITY[) For instance,iubeau 446(love-PAST-TENSE) in the notation below has all three markings The value of 447theFROMattribute of theAFFECTelement will ﬁlled in by the ID of the verb 448iubeau, marked as anENTITY, while the value of theTRIGGERattribute in the 449same relation will be theIDof theTRIGGERelement covering the same word To save space, in the notations showing relations on our examples, we will mark in labeled UNCORRECTED6square brackets, as before, the entities and in pointed brackets - the triggers; the relations themselves are indicated by their sub-types; sometimes, when there is a need to associate triggers to their corresponding relations, these are also labeled Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:13/39 Quo Vadis…13 450 ProofFAuthor 452When a relation is expressed through the eyes of another character, being PROO452453perceived particularly as such by this one, or is still uncertain or to be realised in 454the future we say that the relation is ‘‘interpreted’’ As such, a relationRwill be 455marked asR-interpret All types and sub-types of relations could have 456interpret-ed correspondents 4574 4 Referential Relations 458When the understanding of one RE-to-entity mapping depends on the recuperation 459in memory of a previously mentioned entity (the container together with its 460accumulated content), we say that a referential relation occurs between this RE 461(called anaphor) and that entity (called antecedent) In the literature, this deﬁnition 462presents variants, some (as (Mitkov2003)) insisting on the textual realisation of 463the relation, the anaphor and the antecedent being both textual mentions, while 464others (Cristea and Dima2001) putting in evidence its cognitive or semantic 465aspects as such, the anaphor being a textual mention and the antecedent—an entity 466as represented on a cognitive layer, therefore, in our terms—a container plus its 467content Supplementary, some authors also make the distinction between anaphora 468(when a less informative mention of an entity succeeds a more informative one; for 469instance, a pronoun follows a proper noun) and cataphora (when the other way 470round is true; for instance, the pronoun mention comes before the proper noun 471mention) It is to notice however, as (Tanaka1999) and others have noticed, that 472cataphora could beabsolute(when the text includes no more informative reference to the entity before the less informative one) orrelative(when the inversion takes 474place at the level of a sentence only, a more informative mention being present in a 475sentence that precedes the one the pronoun belongs to) UNCORRECTED473 476In order to mark the direction of a referential relation, for non-imbricated REs, in 477connection with text unfolding (a more recent RE mentions an entity introduced or Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:14/39 14 D Cristea et al 478referred by a previously mentioned RE), the annotation of the poles of the coref- 479erential relations are as follows: the source (FROM) is the more recent one and the 480destination (TO) is the older one In Romanian,7this direction is from the right of the text to its left For example, inpe1:[Ligia] 2:[o]iubesc(1:[Ligia]is2:[the one]I Proof482love), the relation is marked from (Bagga and Balwdin1998) to AlthoughAuthor 483perhaps less intuitive, the direction of cataphoric relations comply with the same F481484right-to-left annotation convention, on the ground that a container (possibly 485including only scarce details) must have been introduced even by a less informative 486mention, while the more informative mention, coming after, refers back in memory 487to this conceptualisation, and injects more information in there (Cristea and Dima 4882001) In the text 1:[îl]chema˘pe2:[Seneca](he1:[him clitic]summoned2:[Sen- 489eca]) the direction is also from (Bagga and Balwdin1998) to Deciphering anaphoric relations in the Romanian language is perhaps more complex than in other 491languages, mainly due to the duplication of the direct and indirect complements by 492unaccented forms of pronouns But we will refrain from making anaphora resolution 493comments in the present study as this topic is outside our declared intent PROO490 494We established nine sub-types of referential relations, listed and exempliﬁed 495below 496•coref: by slightly modifying the deﬁnition given above for referentiality, we 497say that we have a coreferential relation between a RE and an entity E when we 498understand the RE-to-E identity mapping based on the recuperation in memory 499of E, a previously mentioned entity Coref is a symmetric relation, where poles 500could be of typesPERSON,PERSON-GROUP,GODandGOD-GROUPS, but 501always with both poles of the same category It is important to notice that a 502coref relation can never occur between imbricated REs Examples: 5031:[Marcus Vicicius]…2:[el]…(1:[Marcus Vicicius]…2:[he]…)¼) 504coref ; 5051:[Ligia]…2:[tânara liberta˘]…(1:[Ligia]…2:[the young libert]…)¼) 506coref ; 507Nu avea nici cea mai mica˘îndoiala˘ca˘1:[lucra˘torul acela]e2:[Ursus] (He 508had not the least doubt that1:[that laborer]was2:[Ursus] ) 509¼) coref-interpret ; 510L-am prezentat pe1:[acest Glaucus]ca pe2:[ﬁul Iudei]si3:[tra˘da˘tor al 511tuturor crestinilor] (I described1:[Glaucus]as2:[a real son of Judas],and 5123:[a traitor to all Christians] )¼) coref-interpret , 513coref-interpret ; 514•member-of(aPERSONtype RE is amember-ofaPERSON-GROUPentity and, 515similarly, aGODis amember-of aGOD-GROUP), a directed relation Example: 1:[o femeie din2:[societatea înalta˘]] (1:[a woman of2:[the high soci- 517ety]])¼) member-of ; UNCORRECTED516 7 contrary, for instance, to Semitic languages Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:15/39 Quo Vadis…15 518•has-as-member(the inverse ofmember-of, from aPERSON-GROUPto a 519PERSON, or from aGOD-GROUPto aGOD), directed: 1:[Petronius]…2:[amândurora] (1:[Petronius]…to2:[both of 521them])¼) has-as-member ; Proof5221:[Ursus]…2:[Ligia]…3:[voi] (1:[Ursus]…2:[Ligia]…Author 5233:[you PL])¼) has-as-member ; has-as-member ; F520 524•isa(from aPERSONtype RE to its correspondingPERSON-CLASS, or from 525aGODto its correspondingGOD-CLASS), directed; 5261:[nasa˘]sa˘-mi ﬁe2:[Pomponia] (and I wish2:[Pomponia]to be1:[my 527godmother])¼) isa only in the Romanian version; in the English 528version the two REs are inverted, which gives here the inverse relation (see next); 530•class-of(the inverse of isa, from aPERSON-CLASSto an instance of it of 531typePERSON, or from aGOD-CLASSto aGODtype instance), directed: PROO529 532Dar nu esti1:[tu] 2:[un zeu]? (But are1:[thou]not2:[a god]?)¼) 533class- of-interpret ( is seen as a God by someone); 534dati-mi-1:[o]de2:[nevasta˘](Give1:[her]to me as2:[wife])¼) 535class-of- interpret 8; 536Se trezise în1:[el] 2:[artistul], 3:[adoratorul frumusetii] (2:[The artist]was 537roused in1:[him],and3:[the worshipper of beauty])¼)for the Romanian 538version: class-of-interpret ; class-of-interpret 539 ; for the English version, because of the inversion: isa ; 540class-of-interpret ; 541•part-of(a RE of typePERSON-PARTis a part of the body of an entity of 542typePERSON, or aGOD-PARTis a part of the body of an entity of typeGOD), 543directed: 5441:[mâna2:[lui]dreapta˘] (1: right hand])¼) part-of ; 545•has-as-part (the inverse ofpart-of:aPERSONtype RE has as a component 546part aPERSON-PARTentity, or aGODtype RE has as a component part a 547GOD-PARTentity), directed; 548chinurile,1:[sângele]si moartea2:[Mântuitorului](the torment,1:[the 549blood]and the death2:[of the Saviour])¼) has-as-part ; 550•subgroup-of(from a subgroup, i e aPERSON-GROUPtype RE, to a larger 551group, i e also aPERSON-GROUPtype entity which includes it, and similarly forGOD-GROUP’s poles), directed: UNCORRECTED552 8 (Anechitei et al 2013) could become a wife of the speaker but is actually not Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:16/39 16 D Cristea et al 5531:[a], 2:[b], 3:[c]si4:[alte femei din5:[societatea înalta˘]] (1:[a], 2:[b], 5543:[c]and4:[other women of5:[the high society]])¼) has-as- 555member , has-as-member , has-as-member , subgroup-of ; Proof557Christos1:[i]-a iertat si pe2:[evreii]care i-au dus la moarte si peAuthor 5583:[soldatii romani]care l-au tintuit pe cruce (Christ forgave2:[the Jews] F556559who delivered him to death, and3:[the Roman soldiers]who nailed him to 560the cross)¼)for the Romanian version only: subgroup-of ; 561subgroup-of Thesubgroup-ofrelation holds in the Romanian 562version because of the existence of the anticipating pronoun 1:[i], which 563signiﬁes both groups In the English equivalent no such mention appears 564and a subgroup-of relation cannot be formulated; •name-of(inverse of has-name, linking aPERSON-NAMERE to aPERSON 566entity), directed: 5671:[numele lui2:[Aulus]] (1:[the name of2:[Aulus]])¼) name-of ; PROO565 568Petronius…care simtea ca pe statuia1:[acestei fete]s-ar putea scrie: 5692:[‘‘Primavara’’] (Petronius…who felt that beneath a statue of1:[that 570maiden]one might write2:[‘‘Spring ’’])¼) name-of-interpret 571 (-interpretbecause Petronius is the one that gives this name) 5724 5 Kinship Relations 573Kinship (or family) relations (marked KINSHIP as XML elements) occur between 574PERSON,PERSON-GROUP,GODandGOD-GROUPtype of REs and entities 575Seven subtypes have been identiﬁed, detailed below: 576•parent-of(the relation between a parent or both parents and a child or more 577children; a REAis in a parent-of relation withBifAis aparent ofB, i e 578mother, father, both or unspeciﬁed), directed: 5791:[\tata˘l ] (1: \father[])¼) parent- 580of ; 581•child-of (inverse ofparent-of; a REAis a child-ofBif the text presents 582Aas a child or as children ofB), directed: 5831:[Ligia mea]este\ﬁica (1:[My Lygia]is the\daugh- 584ter )¼) child-of ; 5851:[\copilul[drag al2:[celebrului Aulus]] (1:[a dear\child[] 2:[of the famous Aulus])¼) child-of ; 587•sibling-of(between brothers and sisters), symmetric: UNCORRECTED586 5881:[sora lui2:[Petronius]] (1: \sister[])¼) sib- 589ling-of ; Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:17/39 Quo Vadis…17 5901:[niste\frat[ai2:[ta˘i]] (1:[some of2:[your]\brothers[])¼) 591sibling-of ; •nephew-of(we haveAnephew-ofB, ifAis a nephew/niece ofB), 593directed: Author Proof 5941:[scumpii2:[sa˘i]\nepot[] (1: dear\nephews[])¼) F592595nephew-of ; 596•spouse-of(symmetric relation between husbands): 597…cu1:[care]mai târziu2:[Nero],pe jumatate nebun, avea sa˘se\cun- 598une[(…to1:[whom]later2:[the half-insane Nero]commanded the ﬂa- 599mens to\marry[him)¼) spouse-of ; 1:[Vinicius]ar putea sa˘2:[te]ia de\nevasta˘[(1:[Vinicius]might\mar- 601ry )¼) spouse-of-interpret ; 602•concubine-of(symmetric relation between concubins) PROO600 6031:[\concubina ] (1: \concubine[])¼) concubine- 604of ; 605•unknown(a kinship relation of an unspeciﬁed type): 6061:[o\ruda˘[de-a2:[lui Petronius]] (1:[a\relative[of2:[Petronius]])¼) 607 unknown ; 6081:[\stra˘mosilor ] (1: \ancestors[])¼) 609unknown 6104 6 Affective Relations 611Affective relations (marked as AFFECT elements in our XML notations) are non- 612anaphoric relations that occur between REs and entities of typePERSON,PER- 613SON-GROUP,GODandGOD-GROUP There are eleven subtypes, as detailed 614below: 615•friend-of(Ais afriend-ofB, if the text expresses thatAandBare 616friends), symmetric: 6171:[\tovara˘sii ] (1: \comrades[])¼) friend-of ; 6181:[Vinicius]e un nobil puternic, spuse el, si\prieten[cu2:[împa˘ratul] 619(1:[Vinicius]is a powerful lord, said he, and a\friend[of2:[Cæ sar] )¼) 620 friend-of-interpret ; •fear-of(Ais in a relationfear-ofwithBif the text expresses thatAfeels 622fear ofB), directional: UNCORRECTED6216231:[oamenii]\se tem[mai mult de2:[Vesta] (1:[people]\fear 624more)¼) fear-of ; Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:18/39 18 D Cristea et al 6251:[Senatorii]se duceau la2:[Palatin],\tremurând de frica˘[(1:[Sena- 626tors],\trembling in their souls[,went to the2:[Palatine])¼) fear- 627of ; •fear-to(inverse offear-of:Ais in a relationfear-toBis the text Proof629expresses that the REAinspires fear to the entityB), directional:Author F6286301:[Nero]îi\alarma[chiar si pe2:[cei mai apropiat] (1:[Nero] 631did\roused attention[,even in2:[those nearest])¼) fear-to ; 632•love(Ais in a relation love toB, ifAlovesB), directional: 6331:[Ligia] simti ca˘o mare greutate i s-a luat de pe inima˘ \Dorul[acela 634fa˘ra˘margini dupa˘2:[Pomponia] (1:[She]9felt less alone That measure- less\yearning[for 2:[Pomponia])¼) love ; 636\îndra˘gostit[ca1:[Troilus]de2:[Cresida](\in love[,as was1:[Troilus] 637with2:[Cressida])¼) love ; PROO635638•loved-by(inverse of love:Aloved-byB, ifAis loved byB): 639\iubita˘[este1:[Ligia]de2:[familia lui Plautius](\dear was to 6402:[Plautius])¼) loved-by ; 641•rec-love(Arec-loveBif the text mentions a mutual love betweenAand 642B), symmetric: 643\iubita˘ de2:[altul](in\love[with1:[each] 2:[other])¼) 644rec-love ; 645•hate(AhateB, if the text mentions thatAhatesB), directional: 646Pe1:[Vinicus]îl cuprinse o\mânie[na˘prasnica˘si împotriva 6472:[împa˘ratului]si împotriva3:[Acteii] (1:[Vinicius]was carried away by 648sudden\anger[at2:[Cæ sar]and at3:[Acte] )¼) hate , 649hate ; 650•hated-by(Ahated-byB, ifAis hated byB), directional: 651\ura[pe care1:[i]-o purta2:[prefectul pretorienilor](\hatred 652toward of2:[the all-powerful pretorian prefect])¼) hated- 653by 654•upset-on(Aupset-onB, if the text tells thatAfeels upset, disgust, anger, 655discontent, etc onB), directional: 6561:[\Dispretuia[REALISATION=’’INCLUDED’’] 2:[multimea] (1:[He]had a twofold\contempt for )¼) upset- 658on ; UNCORRECTED657 9 In the English equivalent, the mention of Ligia (Anechitei et al 2013) is missing Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:19/39 Quo Vadis…19 659•worship(AworshipB, if the text mentions thatAworshipsB), directional: 6601:[oamenii aceia]nu numai ca˘-si\sla˘veau (1:[those people]not merely\honored )¼) worship ; 6621:[Ligia]îngenunche ca sa˘se\roage (But1:[Lygia]drop- Proof663ped on her knees to\implore )¼) worship ;Author F661664•worshiped-by(Aworshiped-byBif the text mentioned thatAis 665worshiped byB), directional: 6661:[un zeu cu totul neînsemnat]daca˘n-are decât2:[doua˘\adoratoare[] 667(1:[a very weak god], since he has had only2:[two\adherents[])¼) 668worshiped-by 4 7 Social Relations PROO669 670The group of social relations (markedSOCIALin our XML annotations) are non- 671anaphoric relations occurring only betweenPERSONorPERSON-GROUPREs 672and entities They are grouped in six subtypes, as detailed below: 673•superior-of(Asuperior-ofB, ifAis hierarchically aboveB), 674directional: 675\Eliberând[-1:[o], 2:[Nero] (2:[Nero],when he had\freed )¼) 676 superior-of ; 6771:[Nero]a ordonat\predarea (1:[Nero]demanded2:[my] 678\surren-der[)¼) superior-of ; 6791:[un centurion]\în fruntea (1:[a centurion]\at the 680head )¼) superior-of ; 681•inferior-of(inverse ofsuperior-of,Ainferior-ofBifAis hierar- 682chically subordinated toB), directional: 6831:[\consul[pe vremea2:[lui Tiberiu]] (1:[a man of\consular[dignity 684from the time2:[of Tiberius])¼) inferior-of ; 6851:[Tâna˘rul]luptase\sub comanda (1:[The young man] 686was serving then\under )¼) inferior-of ; 6871:[\libertei ] (1: \freedwoman[])¼) 688inferior-of ; 689•colleague-of(Acolleague-ofBif the text explicitly placesAon the same hierarchical level withB), symmetrical: 6911:[\tovara˘sii ] (1: \companions[])¼) colleague- 692of ; UNCORRECTED690 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:20/39 20 D Cristea et al 693•opposite-to(Aopposite-toB, ifAis presented in a position that 694makes her/him opposing toB), directional: Sa˘nici nu-ti treaca˘prin gând sa˘1:[te]\împotrivesti (Do 696not even1:[think;REALISATION=’’INCLUDED’’]of\oppos- Proof697ing )¼) opposite-to-interpret ;Author 6981:[Pomponia si Ligia]otra˘vesc fântânile,\ucid (1:[Pomponia and F695 699Lygia]poison wells,\murder )¼) opposite-to ; 700•in-cooperation-with(Aisin-cooperation-withBif the text 701presentAas performing something together withB), directional: 7021:[Vannius]a chematT1:\în ajutor[pe2:[iagizi],iar3:[scumpii sa˘i 703nepot]pe4:[ligieni] (1:[Vannius]summoned to hisT1:\aid[2:[the Ya- zygi]; 3:[his dear nephews] T2:\called in )¼) in- 705cooperation-with ,trigger:\T1[; in-cooperation- 706with ,trigger:\T2[; PROO704 707•in-competition-with(Aisin-competition-withB, ifAis pre- 708sented as being in a sort of competition withB), directional: 7091:[Petronius] 2:[îl]\întrecea[cu mult prin maniere, inteligenta˘(1:[Pet- 710ronius]\surpassed inﬁnitely in polish, intellect, wit)¼) in- 711competition-with 7124 8 Examples of Combinations of Relations 713In the end of this section we will give a few examples showing complex combi- 714nations of relations 7151:[Vinicius]…e2:[o\ruda˘[de-a3:[lui Petronius]] (1:[Vinicius]…is 7162:[a\relative ])¼) coref-of , KIN- 717SHIP:unknown ; 718Se repezi la1:[Petru]si,luându-2:[i] 3:[mâinile],începu sa˘4:[i] 5:[le]sa˘rute(… 719seized3:[the hand of1:[the old Galilean]],and pressed5:[it]in gratitude to his 720lips )10¼) coref ; part-of (or ); coref (or ); 721coref It is superﬂuous to mark as part-of because it results by 722transitivity from it being coreferential with and beingpart-of 723 7241:[Vinicius]si2:[\tovarasii ] (1:[Vinicius]and2: \com- rades[])¼) coref ; SOCIAL:colleague-of UNCORRECTED725 10 In the English equivalent, two mentions of Peter (Bagga and Balwdin1998; Bejan and Harabagiu2010) are missing Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:21/39 Quo Vadis…21 7265 Creating the Corpus The realisation of a manually annotated corpus incorporating semantic relations 728obliges to a ﬁne-grained interpretation of the text This triggers the danger of non- Proof729homogeneity, due to idiosyncrasies of views, over the linguistic phenomena underAuthor 730investigation, of different annotators working each on different parts of the doc- F727 731ument A correlation activity that would nivelate divergent views is compulsory 732Let’s add to this that many details of the annotation conventions usually settle 733down in iterative sessions of discussions within the group of annotators, following 734the rich casuistry picked up along the ﬁrst phases of the annotation process As 735such, the organisation of the work should be done in such a way as to certify that 736the result of the annotation process contains the least errors possible, and that the conventions are coherently applied over the whole document PROO737 7385 1 Organising the Activity 739The annotation activity of the ‘‘Quo Vadis’’ corpus was performed over a period of 740three terms with students in Computational Linguistics 11An Annotation Manual, 741including an initial set of annotation rules, was proposed by the ﬁrst author to the 742students at the beginning of the activity and discussed with them Then, the stu- 743dents went through some practical classes in which they were taught to use an 744annotation tool Approximately half of the novel was split in equal slices and 745distributed to them and they begun to work independently or grouped by two 746During the ﬁrst term, in weekly meetings, annotation details were discussed, 747difﬁcult cases were presented and, based on them, the Manual was reﬁned At the 748end of the ﬁrst term their activity was individually evaluated and the results 749showed that only about 15 % of them were trustful enough as to be given a full 750responsibility 12As a by-product, we had, at the time, a consistent set of annotation 751rules and PALinkA,13our annotation tool, could incorporate rather stable prefer- 752ences settings (describing the XML structural constraints) 11 a master organised at the ‘‘Alexandru Ioan Cuza’’ University of Iasi by the Faculty of Computer Science, which accommodates graduate students with either a background in Computer Science or in Humanities 12 It was not a surprise that for annotation activities the most dedicated and skillful students were having a Humanity background 13 PALinkA was created by Constantin Ora˘san in the Research Group in Computational Linguistics, at the School of Law, Social Sciences and Communications, Wolverhampton PALinkA was used for annotating corpora in a number of projects, for purposes including: UNCORRECTEDthoseanaphoric and coreferential links in a parallel French-English corpus, summarisation, different versions of the Centering Theory, coreferences in email messages and web pages, or for Romanian name entities Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:22/39 22 D Cristea et al 753We continued the activity during the next two terms with only the best ranked 754members of the former class (among them—a Ph D researcher with a Philology 755background) At the beginning of the next year (September 2013), a few new students with a Philological background went through a rapid training period and Proof757joined the team (among them—two Ph D researchers in Humanities) The qualityAuthor 758improved a lot, but in the detriment of the speed, which continued to be very slow F756759At that moment it became clear to us that it will be impossible to achieve this 760ambitious task by going through the text three times or even only twice, as the 761usual norms for redundant annotation require in order to organise a proper inter- 762annotator agreement process As a consequence, we started to think at other 763methods for obtaining accuracy that would involve only one manual annotation 764pass We imagined different methods for clearing up the corpus from errors, which will be detailed in the following sub-sections 766As shown already, the ﬁles opened in PALinkA have been previously annotated 767in XML with markings signalling word (\W[) and paragraph (\P[) boundaries 768Over these initial elements the annotators have marked: entities, coreferential PROO765 769links, triggers of relations, and relation spans, including attributes indicating the 770poles and the triggers 771In building the manual annotation, there where words are ambiguous, we have 772instructed our annotators to use their human capacity of interpretation in order to 773decide the true meaning of words, the types of relations or the entities that are 774glued by relations 14For instance, words and expressions, based on their local 775sense, could functions as triggers only in some contexts (father, for instance should 776not be taken as signaling aparent-ofrelation if its meaning is that of priest) 7775 2 Acquiring Completeness and Correctness 778Along the whole process of building the ‘‘Quo Vadis’’ corpus, the two main 779preoccupations were: to acquire completeness (therefore to leave behind as few as 780possible unmarked entities or relations) and to enhance its quality (therefore to 781clean the corpus of possible errors) As said already, in order to distribute the text 782to different annotators, we splitted the text of the novel into chunks of relatively 783equal size (phase 1, in Fig 1) It resulted a number of 58 chunks, each including 784on average approximately 123 sentences The following formula was used to 785estimate the density of annotations (D) to each chunk: 786 D¼ðEþ2Rþ5ðAþKþSÞÞ=N 788 UNCORRECTED788 14 Not rare were cases when philologists asked:And how would the machine recognise this relation when it was difﬁcult even for me to decipher it here?!… Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:23/39 Quo Vadis…23 ProofFAuthor PROO Fig 1Annotation-correction-sewing-merging cycles in the building of the ‘‘Quo Vadis’’ corpus 789where:E=number of marked entities;R=number of markedREFERENTIAL 790relations;A, K, S=number of markedAFFECT,KINSHIPandSOCIALrela- 791tions,N=number of sentences 792During the annotation process, the density scores per segment varied between 0 793to more than 20 Assuming an approximately uniform density all over the novel,15 794these scores allowed us to detect from the blink of an eye those chunks which 795received too little attention from the part of the annotators and to spot also the most 796diligent annotators After the ﬁrst round, only the best ranked annotators were 797retained in the team In the second round, all chunks scored low, therefore con- 798tributed by dismissed students, were resubmitted for a second annotation round to 799the selected members remained in the refreshed team (4) At this moment, all 800chunks are scored over 5 5, the maximum reaching 20 2 and the whole novel 801having an average density score of 9 4 But this score does not reﬂect the 802correctness 803The ﬁnal step in the construction of the corpus was dedicated to enhancing the accuracy As said, because of the very high complexity of the task, which makes it 805extremely time-consuming, and the scarcity of skilled people able to do an expert UNCORRECTED804 15 Not necessarily true, because long passages of static descriptions are bare of mentions of entities and, consequently, relations Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:24/39 24 D Cristea et al 806annotation task, no inter-annotator agreement has been possible to organise 807However, more measures to enhance correctness were assured 808In phase 2 (Correction on Fig 1), the best trained annotators of the team received the additional task of error-prooﬁng the annotations of their last year Proof810colleagues, updating them to the new standards and unifying them with the newAuthor 811ones Then, in the 3rd phase (Sewing in Fig 1) the cross-chunks-border corefer- F809812ential links were notated, as pairs ofENTITYIDs These lists were then passed to 813the 4th phase (Merging in Fig 1), in which the chunks of annotated text were 814physically merged in just one ﬁle andREFERENTIAL XMLelements with 815TYPE=‘‘coref’’were added at the end of the document for all cross-border 816coreferential pairs In this phase, error-detection ﬁlters were also run These ﬁlters 817are described in Sect 5 3 The errors signalled by the ﬁlters were passed back to annotators and they repaired the errors in the original ﬁles Lists of coreferential 819entity names were also produced and these were important clues to notice errors of 820coreferentiality For instance, it is impossible that an instance of Ligia appears in 821the same chain with Nero, and very unlikely that a plural pronoun would ever refer PROO818 822a character Moreover, chains representing group characters, if containing pro- 823nouns, should include only pronouns in plural 8245 3 Error Correcting Filters 825We list in this section a number of ﬁlterring procedures that helped to detect 826annotator errors 827•We call a coreference chain (CC) a list of REs whose occurrences are 828sequentially ordered in the text and which all represent the samePERSON/ 829GODentity or the samePERSON-GROUP/GOD-GROUPentity16¼)any 830proper noun that appears in a CC should be a variation of the name of that 831entity We have extracted one occurrence for all proper names in CCs and 832manually veriﬁed if they are variations, inﬂections or nick-names of the name 833of the same character (Ex Marcus,ViniciusandMarcus Viniciusfor the 834character [Vinicius], orLigia,Ligiei,Callina,Callineifor the character [Ligia], 835orNero,Barba˘-Ara˘mie,Ahenobarbus,Cezar,Cezarul,Cezarului, etc for the 836character [Nero]); 837•All common nouns and pronouns in a CC generally have the same num- 838ber+gender values17¼)For eachWhaving the category common noun or Let’s note that theREFERENTIAL:coref links should separate the whole class ofENTITY elements into disjoint trees Trees and not general graphs, because consideringENTITYs as nodes in the graph andREFERENTIAL:coref relations as edges, there is just one TO value UNCORRECTED16(parent in the graph) for eachENTITYnode 17 There are exceptions to this rule: a plural may be referred by a singular noun denoting a group, or due to errors of the POS-tagger, etc Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:25/39 Quo Vadis…25 839pronoun in a CC we have extracted the pairs number+gender values and 840reported if they are not identical; 841•It is improbable that an entity be referred only by pronouns¼)We have listed the CCs that include only pronouns and passed them to the correctors for a Proof843second look;Author 844•In most of the cases, gods are referred to by names in capital letters¼)We F842845have reported the exceptions; 846•There should be a one-to-one mapping between triggers and relations¼) 847Report if a trigger is not referred in any relation or if more relations use the 848same trigger; 849•Triggers could not appear as values ofFROMandTOattributes and no element 850type other thanTRIGGERcould to a value of aTRIGGERargument of a relation¼)We performed an argument type checking (see (Na˘stase et al 8522013)) by citing (Pasca et al 2006; Rosenfeld and Feldman2007) for 853‘‘matching the entity type of a relation’s argument with the expected type in 854order to ﬁlter erroneous candidates’’ Combined (or coupled) constraints, as PROO851 855proposed by (Carlson et al 2010) in semi-supervised learning of relations in 856context, were not of primary interest at the moment of building the corpus 857•In the vast majority of cases, the two poles and the trigger belong to the same 858sentence For instance, in the example: 1:[e;REALISATION=‘‘INCLU- 859DED’’] 2:[-un patrician], 3:[prieten cu4:[împa˘ratul]] (1:[he]is2:[a patrician], 8603:[a friend4:[of Caesar]]), the correct annotation is as follows: class-of ; 861 class-of ; friend-of As such, the friend-of relation does not 862cross the borders of the second sentence ¼)We report cross-sentences non- 863coreferential relation spans and asked the correctors to verify them 8646 Statistics Over the Corpus 865In this section we present a number of statistics and comment on the semantic links 866of the corpus from a global perspective Table 6 presents the Corpus by numbers AQ2 867It can be seen that 20 % of the tokens of the novel are covered by some manual 868annotation (entity, trigger, relation) The vast majority of relations are those 869belonging to theREFERENTIALtype A comparison is shown in the diagram of 870Fig 2(Table1) AQ3 871If the 17,916REFERENTIAL:corefandREFERENTIAL:coref- 872interpretrelations (the most numerous) are left aside, the distribution is 873depicted in Fig 3 In Fig 4, the distributions of different types ofREFERENTIAL relations (withoutREFERENTIAL:corefandREFERENTIAL:coref- 875interpret) is shown 876Figures5,6and7show the distributions ofKINSHIP,SOCIALandAFFECT UNCORRECTED874 877relations in the corpus Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:26/39 26 D Cristea et al ProofFAuthor PROO Fig 2Comparing families of relations Table 1The corpus at a glance Counted elements Values # sentences 7,150 # tokens (Welements, punctuation included) 144,068 # tokens (Welements, excluding punctuation) 123,093 # tokens under at least one annotation (punctuation included) 28,851 # tokens under at least one relation (punctuation included) 7,520 # tokens summed up under all relations (punctuation included) 9,585 # entities 22,310 #REFannotations (all) 21,439 #REF:corefandREF:coref-interpretannotations 17,916 # AKS annotations 1,133 #TRIGGERannotations 1,097 total # annotations (ENTITY+TRIGGER+REF+AKS) 45,979 overall density score 10 21 878Long relation spans make the discovery of relations difﬁcult The graphic in 879Fig 8shows the ranges of lengths ofREFERENTIALrelations spans whose 880lengths can be estimated, thusREFERENTIAL:corefandREFEREN- TIAL:coreﬁnterpretare not considered 882As can be seen, the average span for this group of relations is placed somewhere 883around 20 words In Fig 9the same statistics is shown for the other three families 884of relations A rapid glance shows thatKINSHIPrelations are expressed over a UNCORRECTED881 885shorter context than other types This is mainly because manyKINSHIPrelations 886are contextualised in noun-phrase expressions (his mother,the son of X, etc ) Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:27/39 Quo Vadis…27 ProofFAuthor 3Comparing families of relations (withoutREFERENTIAL:coref andREFEREN- PROOFig TIAL:coref-interpret) Fig 4Distribution ofREFERENTIALrelations (withoutREFERENTIAL:corefandREF- ERENTIAL:coref-interpret) To see how often appear in the corpus long spans with respect to short spans, Fig 10 888shows the density of different lengths of relation spans (of course, excludingREF- UNCORRECTED887889ERENTIAL:corefandREFERENTIAL:coref-interpret) Its abrupt 890descending allure shows that short spans occur much frequently than long spans There Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:28/39 28 D Cristea et al ProofFAuthor PROO Fig 5Distribution ofKINSHIPrelations Fig 6Distribution ofSOCIALrelations UNCORRECTED Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:29/39 Quo Vadis…29 ProofFAuthor PROO Fig 7Occurrences ofAFFECTrelations Fig 8Span length ranges and averages in number of words forREFERENTIALrelations (exceptingREFERENTIAL:corefandREFERENTIAL:coref-interpret) UNCORRECTED Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:30/39 30 D Cristea et al ProofFAuthor PROO Fig 9Span length ranges and averages for AKS relations Fig 10Relations occurences (Oy) in correlation with the relation span length (Ox) 891is a nose for length 3, indicating the most frequent span The longest relation covers 154 892words (but there is only one of this length) 18Supposing we would mark withfðxÞthe 893function in Fig 10, the total number of words in the spans of the relations would be: In this version of the corpus we did not make a thorough veriﬁcation of long relations We have noticed some errors in the annotation of poles, especially when one of the two poles are null UNCORRECTED18pronouns in the position of subjects andREALISATION=‘‘INCLUDED’’has not been marked on the respective verbs In reality, a long distance coref relation would link the main verb (or its auxiliary) to an named entity, which now stands as one of the poles Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:31/39 Quo Vadis…31 894 X154 xfðxÞ¼9;585 ¼2 896896 Proof897It corresponds with the total count of XML\W[…\/W[markings under someAuthor Fx898relation span different thanREFERENTIAL:corefandREFEREN- 899TIAL:coreﬁnterpret This means approximately 6 66 % of the total area of 900the book, including punctuation (144,068 tokens) 901An analysis of this kind is interesting because it reveals in approximate terms 902the proportion between positive and negative examples in an attempt to decipher 903automatically relations and could be of help when designing the sample set in a statistical approach to train from the corpus a recognition program 905Another set of statistics addresses the triggers We are interested to know to 906what degree triggers are ambiguous The sparse matrix in Table2allows an 907analysis of this type On rows and columns all relations are placed and the number PROO904 908in a cell (R1,R2) indicates how many triggers, as (sequence of) lemmas, are 909common between the relationsR1 andR2 910The last set of graphical representations are semantic graphs Figures11,12and 91113show sets of affection, family and social relations for some of the most rep- 912resentative characters in the novel 913Nodes in these graphs represent entities Each one of them concentrates all 914coreferential links of one chain Nodes names were formed by choosing the largest 915proper noun in each chain When a character is mentioned with more different 916names, a concatenation of them was used (as is the case withLigia—Callina) For 917the chains (usually small) that do not include proper nouns, one of the common 918nouns was used When, doing so, more chains (nodes) got the same name, after 919verifying that the corresponding chains are indeed distinct, the names have been 920manually edited by appending digits 921In Fig 11, for instance, can be read a love relation from Nero towards Acte, his 922child, his entourage (society) and, accidentally, Ligia Also, there are reciprocal 923love relations linking Vinicius and Ligia, while Petronius is loved by Eunice and 924loves Vinicius 925Figure12concentrates both sets of relationsparent-ofandchild-of, by 926reversing the sense of relationschild-of The family relations (not too many) 927expressed in the novel are now evident Ligia-Callina has two fathers, the Phrygian 928king and Aulus Plautius 929Finally, Fig 12reveals thesuperior-of inferior-ofpair of links 930(also by reversing the sense of relationsinferior-of) Central in this graph is, as expected, the emperor Nero, socially superior to almost all characters in the 932novel There is no edge pointing towards the node representing this character in the 933graph Following him come: Vinicius (revealed as being superior to people of 934Rome, to the two slaves Demas and Croton, as well as to other servants, slaves and UNCORRECTED931 935liberated slaves) and Petronius (linked to his servants and slaves, to pretorians, but 936also to his beloved Eunice) As expected, there is no superiority relation between Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:32/39 32 D Cristea et al Table 2The ambiguity of triggers ProofFAuthor PROO UNCORRECTED Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:33/39 Quo Vadis…33 Fig 11A network of relationsAFFECT:love ProofFAuthor PROO Fig 12A network of relationsKINSHIP:parent-of 937the two close friends Petronius and Vinicius In weaker relationships with the 938emperor Nero, Vinicius is not mentioned as his inferior, while Petronius, his 939cultural counsellor and praise-giver, he is The only superior Vinicius seems to 940have over the whole novel is Corbulon, a former military chief 941As remarked in (Na˘stase et al 2013), ‘‘The importance of an entity in a 942semantic graph is determined not only by the number of relations the entity has, 943but also by the importance of the entities with which it is connected So, for a 944character to be inﬂuential in the novel it is not enough to have many relations, but 945to be related with inﬂuential characters too The PageRank (Brin and Page1998) could be applied to measure the centrality/inﬂuence of an entity according to its 947position in the graph’’ Such estimations are yet to be made in a further research, 948but even only a simple visual inspection of our graphs puts in evidence the central UNCORRECTED946949characters: Vinicius, Petronius, Nero, Ligia Let’s note also that all these graphs 950display only once the sets of homonymous relations More sophisticated Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:34/39 34 D Cristea et al ProofFAuthor PROO Fig 13A sub-network of relationsSOCIAL:superior-of 951representations, showing also the number of relations, not only their types, could 952put in evidence with more clarity the annotations in the corpus Moreover, chains 953of social links could also evidence hierarchical positions in the society (as, for 954instance, the one connecting insuperior-ofrelations the characters Nero, 955Ligia, a ligian and Croton) Combining graph relations could also evidence 956complex situations or plot developments, as for instance the distinction between a 957family type of affection (between Ligia and Plautius’s wife, for instance, Plautius’s 958wife being a parent for Ligia) and lovers (the sentiment that Vinicius develops 959versus Ligia and vice versa, neither of these doubled by any kinship relation) 960The examples put forth are bits of complex interpretations They reveal that the 961detection of semantic relations could incumber complex reasoning steps, thus 962including germs for a true understanding of the semantic content of a big coherent 963text 7 Conclusions and Further Work 965The research aims to formalize relationships between characters of a novel, thus UNCORRECTED964966establishing precise criteria that underpin aspects of the interpretation of text The 967annotations that we propose can be considered as representation bricks in a project Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:35/39 Quo Vadis…35 968pertaining to the interpretation of free texts Since the world described in our 969corpus is a ﬁctional one, free of any constraints, we believe that the representation 970of entities and their links made explicit in the annotation constitute pieces of knowledge that can be extended to other universes or world structures with Proof972minimum adaptations Author 973The enterprise of building the ‘‘Quo Vadis’’ corpus was an extremely time F971974consuming and difﬁcult one, necessitating iterative reﬁnements of the annotation 975conventions, followed by multiple corrections, sewing and merging steps We 976considered that describing this process could be interesting per se, as a model to 977apply in building other corpora, when the scarcity of resources do not permit 978passing the manual work under the critical eyes of more subjects Instead, an 979iterative improvement methodology was applied, by designing syntactical and semantic ﬁlters, running them and correcting the reported errors 981The corpus is still too fresh to risk a detailed numerical report and interpretation 982on it In the next few months it may still undergo further improvements 19The 983graphics and tables we presented should, therefore, be interpreted more qualita- PROO980 984tively than quantitatively, e g in terms of the rates between different types of 985relations They allow to sketch a perception about the density of person and god 986types of entities and the relations mentioned among them in a literary freestyle 987text Of course, from genre to genre, style to style and document to document the 988densities and rates may vary dramatically, but, we believe, proportions will remain 989within the same orders of magnitude 990The corpus is intended to be put at the base of a number of investigations in the 991area of semantic links, mainly oriented towards their automatic identiﬁcation For 992sophisticating the features to be used in the process of training statistical relation 993recognition programs, other layers of annotation could be useful, the most evident 994one being the syntactic layer, for instance, dependency links Then, on top of the 995annotations already included, other types of entities and relations could be further 996added Examples of sophistications include: notation of places and relations 997between people and places, or between places and places Such markings could put 998in evidence descriptions of journeys in travelling guides, or geographical relations 999in high school manual Of a different kind, extensively studied (see (Mani et al 10002006) for a survey), are the temporal relations 1001Of a certain interest could be the issue of exporting annotations between par- 1002allel texts For instance, from the Romanian version of ‘‘Quo Vadis’’ to its English 1003or Polish version If this proves possible, then a lot of time and money could be 1004saved 1005In the process of deep understanding of texts, on top of discovering inter-human 1006or human-god relationships could be placed superior levels of interpretation, as, for instance, deciphering groups manifesting a distinctive, stable and cohesive 1008social behaviour (as is, in the novel, the group of Romans and that of Christians) If UNCORRECTED1007 19 One of the authors is elaborating a personal dissertation thesis (due June 2014) having as theme this corpus, being responsible for its correctness and complete statistics over it Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:36/39 36 D Cristea et al 1009time is added to the interpretation, then developments of stories could be traced as 1010well Sometimes individuals migrate from one group to the other (see Vinicius) 1011and the range of sentiments and social relations might change (Vinicius to Christ: from lack of interest to worship and Vinicius to Nero: frominferior-ofto Proof1013insubordination) On another hand, a society, as a whole, can be characterised byAuthor 1014the set of inter-individual relationships and interesting contrasts could be deter- F10121015mined The new society of Christians, with their affective relations of love and 1016worship, regenerate the old and decadent society of Romans, especially that cul- 1017tivated at the court of the emperor The inferior-of, fearandhate 1018relations, frequent between slaves and their masters are replaced byin-coop- 1019eration-with,friendshipand love, characteristic to the Christian model 1020The corpus includes only explicitly evidenced relations (what the text says), but in many cases a human reader deduces relations on a second or deeper level of 1022inference Moreover, some relations explicitly stated are false, insincere, as for 1023instance the declared love or worship sentiments of some underdogs with respect 1024to the emperor To deduce the falsity of relations, it could mean, for instance, to PROO1021 1025recognise relations of an opposite type, stated in different contexts and towards 1026different listeners by the same characters All these could be subjects of further 1027investigation, but to do such complicated things one should start by doing simple 1028things ﬁrst, as is the automatic discovery of clearly stated relations, such as those 1029annotated in the ‘‘Quo Vadis’’ Corpus 1030AcknowledgmentsWe are grateful to the master students in Computational Linguistics from 1031the ‘‘Alexandru Ioan Cuza’’ University of Iasi, Faculty of Computer Science, who, along three 1032consecutive terms, have annotated and then corrected large segments of the ‘‘Quo Vadis’’ corpus 1033Part of the work in the construction of this corpus was done in relation with COROLA—The 1034Computational Representational Corpus of Contemporary Romanian, a joint project of the 1035Institute for Computer Science in Iasi and the Research Institute for Artiﬁcial Intelligence in 1036Bucharest, under the auspices of the Romanian Academy AQ4 1037References 1038Anechitei, D , Cristea, D , Dimosthenis, I , Ignat, E , Karagiozov, D , Koeva, S , et al (2013) 1039Summarizing short texts through a discourse-centered approach in a multilingual context In 1040A Neustein & J A Markowitz (Eds ),Where humans meet machines: Innovative solutions to 1041knotty natural language problems Heidelberg: Springer 1042Bagga, A , & Balwdin, B (1998) Entity-based cross-document coreferencing using the vector 1043space model Proceedings of COLING ‘98, 1 1044Banko, M , Cafarella, M J , Soderland, S , Broadhead, M , & Etzioni, O (2007) Open 1045information extraction from the web Proceedings of IJCAI ‘07 Bejan, C A , & Harabagiu, S (2010) Unsupervised event coreference resolution with rich 1047linguistic features Proceedings of the 48th Annual Meeting of the Association for 1048Computational Linguistics, Uppsala, Sweden 1049Brin, S , & Page, L (1998) The anatomy of a large-scale hypertextual Web search engine 1050Computer Networks and ISDN systems, 30(1), 107–117 UNCORRECTED1046 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:37/39 Quo Vadis…37 1051Boschee, E , Weischedel, R , & Zamanian, A (2005) Automatic information extraction 1052Proceedings of the 2005 International Conference on Intelligence Analysis, McLean, VA, 1053pp 2–4 Bunescu, R C , & Pasca, M (2006) Using encyclopedic knowledge for named entity disambiguation 1055European Chapter of the Assocation for Computational Linguistics(EACL 2006) Proof1056Carlson, A , Betteridge, J , Wang, R C , Hruschka Jr , E R , & Mitchell, T M (2010) CoupledAuthor 1057semi-supervised learning for information extraction Proceedings of the Third ACM F10541058International Conference on Web Search and Data Mining(WSDM 2010) 1059Chen, B , Su, J , Pan, S J , & Chew L T (2011) A uniﬁed event coreference resolution by 1060integrating multiple resolvers Proceedings of the 5th International Joint Conference on 1061Natural Language Processing, pp 102–110, Chiang Mai, Thailand 1062Cristea, D , & Dima, G E (2001) An integrating framework for anaphora resolution Information 1063Science and Technology, Romanian Academy Publishing House, Bucharest, 4(3–4), 273–291 1064Cruse, D A (1986) Lexical semantics Cambridge: Cambridge University Press 1065Cucerzan, S (2007) Large-scale named entity disambiguation based on wikipedia data Empirical Methods in Natural Language Processing(EMNLP) 1067Cybulska, A , & Vossen, P (2012) Using semantic relations to solve event coreference in text 1068Proceedings of Semantic Relations-II Enhancing Resources and Applications Workshop, 1069Istanbul PROO1066 1070Del Gaudio, R (2014) Automatic extraction of deﬁnitions Ph D thesis, University of Lisbon 1071Drabek, R , & Yarowsky, D (2005) Induction of ﬁne-grained part-of-speech taggers via classiﬁer 1072combination and crosslingual projection Proceedings of the ACL Workshop on Building And 1073Using Parallel Texts: Data-Driven Machine Translation And Beyond, June 29–30, 2005, Ann 1074Arbor, Michigan, pp 49–56 1075Gala, N , Rey, V , & Zock, M (2010) A tool for linking stems and conceptual fragments to 1076enhance word access Proceedings of LREC-2010, Malta 1077Girju, R , Badulescu, A , & Moldovan, D (2006) Automatic discovery of part-whole relations 1078Computational Linguistics,32(1), 83–135 1079Hearst, M (1992) Automatic acquisition of hyponyms from large text corpora Proceedings of 1080COLING ‘92 1081Hjørland, B (2007) Semantics and knowledge organization Annual Review of Information 1082Science and Technology, 41, 367–405 1083Iida, R , Komachi, M , Inui, K , & Matsumoto, Y (2007) Annotating a Japanese text corpus with 1084predicate-argument and coreference relations Proceedings of the Linguistic Annotation 1085Workshop, pp 132–139 1086Kawahara, D , Kurohashi, S , & Hasida, K (2002) Construction of a Japanese relevance-tagged 1087corpus Proceedings of LREC ‘02 1088Levi, J N (1978) The syntax and semantics of complex nominals New York: Academic Press 1089Lyons, J (1977) Semantics Cambridge: Cambridge University Press 1090Malmkjær, K (1995) Semantics In K Malmkjær (Ed ),The linguistics encyclopedia(pp 1091389–398) London: Routlage 1092Mani, I , Wellner, B , Verhagen, M , Lee, C M , & Pustejovsky, J (2006) Machine learning of 1093temporal relation Proceedings of the 44th Annual meeting of the Association for 1094Computational Linguistics, Australia 1095Masatsugu, H , Kawahara, D , & Kurohashi, S (2012) Building a diverse document leads corpus 1096annotated with semantic relations Proceedings of the 26th Paciﬁc Asia Conference on 1097Language, Information and Computation, pp 535–544 Mazlack, L (2004) Granular causality speculations IEEE Annual Meeting of the Fuzzy 1099Information, 2004 Processing NAFIPS 04 690695 1100Miller G A , Beckwidth R , Fellbaum C , Gross D , & Miller K J (1990) Introduction to WordNet: 1101An on-line lexical database International Journal of Lexicography,3(4)(winter 1990), 235–244 UNCORRECTED1098 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:38/39 38 D Cristea et al 1102Mitkov, R (2003) Anaphora resolution In R Mitkov (Ed ),The oxford handbook of 1103computational linguistics(pp 266–283) Oxford: Oxford University Press 1104Mulkar-Mehta, R , Hobbs, J R , & Hovy, E (2011) Granularity in natural language discourse Proceedings of International Conference on Computational Semantics 1106Murphy, M L (2003) Semantic relations and the lexicon: Antonymy, synonymy, and other Proof1107paradigms Cambridge: Cambridge University Press Author 1108Na˘stase, V , Nakov, P , Séaghdha, D Ó , & Szpakowicz, S (2013) Semantic relations between F11051109nominals California: Morgan & Claypool Publishers 1110Ohara, K (2011) Full text annotation with Japanese framenet: Study to annotation semantic 1111frame to bccwj (in japanese) Proceedings of the 17th Annual Meeting fo the Association for 1112Natural Language Processing, pp 703–704 1113Pantel, P , Ravichandran, D , & Hovy, E (2004) Towards terascale knowledge acquisition 1114Proceedings of COLING ‘04 1115Pasca, M , Lin, D , Bigham, J , Lifchits, A , & Jain, A (2006) Names and similarities on the 1116Web: Fact extraction in the fast lane Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational 1118Linguistics, pp 809–816, Sydney, Australia 1119Postolache, O , Cristea, D , & Orasan, C (2006) Transferring coreference chains through word 1120alignment Proceedings of LREC-2006, Geneva PROO1117 1121Quillian, M R (1962) A revised design for an understanding machine Mechanical Translation, 11227, 17–29 1123Rao, D , McNamee, P , & Dredze, M (2012) Entity linking: Finding extracted entities in a 1124knowledge base In T Poibeau, H Saggion, J Piskorski, & R Yangarber (Eds ),Multisource 1125multilingual information extraction and summarization, Springer lecture notes in computer 1126science Berlin: Springer 1127Rello, L , & Ilisei, I (2009) A comparative study of Spanish zero pronoun distribution 1128Proceedings of the International Symposium on Data and Sense Mining, Machine Translation 1129and Controlled Languages (ISMTCL), pp 209–214 1130Rodríguez, K J , Delogu, F , Versley, Y , Stemle, E W , & Poesio, M (2010) Anaphoric 1131annotation of Wikipedia and blogs in the live memories corpus Proceedings of the Seventh 1132conference on International Language Resources and Evaluation (LREC ‘10) 1133Rosenfeld, B , & Feldman, R (2007) Using corpus statistics on entities to improve 1134semisupervised relation extraction from the Web Proceedings of the 45th Annual Meeting 1135of the Association for Computational Linguistics, pp 600–607, Prague, Czech Republic 1136Pollard, C , & Sag, I A (1994) Head-driven phrase structure grammar Chicago: University of 1137Chicago Press 1138Saggion, H (2007) SHEF—semantic tagging and summarization techniques applied to cross- 1139document coreference Proceedings of SEMEVLA ‘07 1140Séaghdha, D Ó , & Copestake, A (2008) Semantic classiﬁcation with distributional kernels 1141Proceedings of the 22nd International Conference on Computational Linguistics (COLING- 114208), Manchester, UK 1143Singh, S , Subramanya, A , Pereira, F , & McCallum, A (2011) Large-scale cross-document 1144coreference using distributed inference and hierarchical models Proceedings of HLT ‘11, 1 1145Simionescu, R (2012) Romanian deep noun phrase chunking using graphical grammar studio In M 1146A Moruz, D Cristea, D Tuﬁs, A Iftene, H N Teodorescu (Eds ),Proceedings of the 8th 1147International Conference ‘‘Linguistic Resources and Tools for Processing of the Romanian 1148Language’’, pp 135–143 Snow, R , Jurafsky, D , & Ng, A Y (2006) Semantic taxonomy induction from heterogeneous 1150evidence Proceedings of COLING-ACL ‘06 1151Tanaka, I (1999) The value of an annotated corpus in the investigation of anaphoric pronouns, with 1152particular reference to backwards anaphora in English Ph d thesis, University of Lancaster UNCORRECTED1149 Layout:T1 Standard SCBook ID:323543 1 EnBook ISBN:978-3-319-08042-0 Chapter No :28Date:27-6-2014Time:7:45 pmPage:39/39 Quo Vadis…39 1153Tesnière, L (1959) Éléments de syntaxe structurale Paris: Klincksieck 1154Zock, M (2010) Wheels for the mind of the language producer: microscopes, macroscopes, 1155semantic maps and a good compass In V Barbu Mititelu, V Pekar, & E Barbu (Eds ), Proceedings of the Workshop Semantic Relations Theory and Applications 1157Zock, M , Ferret, O , & Schwab, D (2010) Deliberate word access: An intuition, a roadmap and some Proof1158preliminary empirical results International Journal of Speech Technology, 13, 201–218 Author 1159Zock, M , & Schwab, D (2013) L’index, une ressource vitale pour guider les auteurs a trouver le F11561160mot bloque sur le bout de la langue In N Gala, & M Zock (Eds ),Ressources lexicales: 1161construction et utilisation Lingvisticae Investigationes Amsterdam: John Benjamins PROO UNCORRECTED Author Query Form Proof ID :323543 1 EnAuthor FBook123the Chapter No :28 language of science Please ensure you fill out your response to the queries raised below and return this form along with your corrections Author During the process of typesetting your chapter, the following queries have arisen Please check your typeset proof carefully against the queries listed PROODearbelow and mark the necessary changes either directly on the proof/online grid or in the ‘Author’s response’ area provided below Query Refs Details Required Author’s Response AQ1Please check and confirm that the authors and their respective affiliations have been correctly identified and amend if necessary AQ2Kindly note that Table 6 is cited in text but the corresponding table is missing Please check and confirm AQ3Please check and confirm the inserted citation of Table 1 is correct If not, please suggest an alternate citation Please note that tables should be cited in sequential order in the text AQ4References ‘Hjørland (2007), Malmkjær (1995), Mazlack (2004)’ are given in list but not cited in text Please cite in text or delete from list UNCORRECTED MARKED PROOF Please correct and return this set Please use the proof correction marks shown below for all alterations and corrections If you wish to return your proof by fax you should ensure that all amendments are written clearly in dark ink and are made well within the page margins Instruction to printerTextual markMarginal mark Leave unchangedunder matter to remain in text the matterNew matter followed byInsert in the marginorindicated through single character, rule or underlineDelete oror through all characters to be deleted Substitute character or through letter or character ornew substitute part of one or through charactersnew characters more word(s) to italicsunder matter to be changedChange to capitalsunder matter to be changedChange to small capitalsunder matter to be changedChange to bold typeunder matter to be changedChange to bold italicunder matter to be changedChange to lower caseEncircle matter to be changedChange italic to upright type(As above)Change bold to non-bold type(As above)Change or ‘superior’ character character orInsertunder characterthrough where required ore g ‘inferior’ character above)Insertover character(As e g full stop(As above)Insert comma(As above)Insert orand/or single quotation marks(As above)Insert or orand/or double quotation marks(As above)Insert or hyphen(As above)Insert Start new paragraph No new paragraph Transpose uplinking charactersClose or substitute spacethrough character orInsert characters or wordswhere requiredbetween space betweenbetween characters orReduce or wordswords affectedcharacters